The US Department of Transport Bureau of Statistics publishes Flight on-time monthly data the dataset primarly show the timeliness of flights, their origin and destimation details. Further details about the dataset can be found at https://www.transtats.bts.gov/Fields.asp The data is avaiable at https://www.transtats.bts.gov/DL_SelectFields.asp for public to download.
For this Analysis let us cosider latest data available, that is first 2 quarters of 2019. Jan to Jun 2019 data has been downloaded in 6 seperate comma seperated file.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
df_jan = pd.read_csv('T_ONTIME_REPORTING_Jan_2019.csv',
engine='c')
print(df_jan.describe(include='all'))
pd.options.display.max_columns = None
display(df_jan.sample(5))
df_jan[df_jan['DIV_REACHED_DEST'].notnull()].sample(5)
df_jan[df_jan['DIVERTED']!= 0].sample(10)
df_feb = pd.read_csv('T_ONTIME_REPORTING_Feb_2019.csv',
engine='c')
df_mar = pd.read_csv('T_ONTIME_REPORTING_Mar_2019.csv',
engine='c')
df_apr = pd.read_csv('T_ONTIME_REPORTING_Apr_2019.csv',
engine='c')
df_may = pd.read_csv('T_ONTIME_REPORTING_May_2019.csv',
engine='c')
df_jun = pd.read_csv('T_ONTIME_REPORTING_Jun_2019.csv',
engine='c')
# Concatinate all frames to 1 dataframe
frames = [df_jan, df_feb, df_mar, df_apr, df_may, df_jun]
df_H12019 = pd.concat(frames, sort=True)
df_H12019.info()
Let us get the data for rest of the months
The dataset is has main sections of data with data fields related to the section
- Time Period
- Year, Quater, Month, DayoftheMonth, DayoftheWeek, FlighDate
- Airline
- Detailed Airlines company and Filght
- Origin
- Details of originating City and Airport
- Destination
- Details of Destination City and Airport
- Departure Performance
- Metrics on depature time and delay
- Arrival Performance
- Metrics on Arrival time and delay
- Cancellation and Diversions
- Indicators of cancelled and divereted flights
- Flight Summary
- Metrics on planned and Actual flight time, distances etc.
- Cause of Delay
- Gate return to origin Aiprport
- Diverted flight details
Further to the data fields are dervided into bin certain field like
- Delay Time block - The hour duration of the day like 1900-1959
- Delay Time Group - Blocks of every 15 min , 0-15 is 0 and 0-30 is 1 and so on, to indicate the sevirity of the delay.
- Distant Group - The blocks of 250 kms is indicated as distant group. first 250 kms is 1 and upto 500 kms is 2 and so on.
- Delay in minutes due to each Catergories of delay is shown seperately to show all causes of delay
The focus of the analysis is to find out what are the primary cause of day, In term of Day of the week , time of the day , routes causing delays
By general sense we could say weather could cause the most delay follow by NAS ( Natioanal Air Saftey ) technical reasons. We could see by further doing the analysis
# We dont want both diverted and cancelled flights
df_ndely = df_H12019.query('CANCELLED != 1 & DIVERTED != 1')
df_ndely.sample(5)
fig, ax = plt.subplots(nrows=2, figsize=[15,10])
sb.distplot(df_ndely['DEP_DELAY_NEW'],
ax=ax[0])
sb.distplot(df_ndely['ARR_DELAY_NEW'],
ax=ax[1])
fig, ax = plt.subplots(nrows=2, figsize=[8,8])
var = ['DEP_DELAY_GROUP', 'ARR_DELAY_GROUP']
for i in range(len(var)):
ax[i].hist(data=df_ndely, x=var[i])
ax[i].set_xlabel('{} in n * 15 min blocks'.format(var[i]))
ax[i].set_xticks(np.arange(-2,14,1))
plt.show()
fig, ax = plt.subplots(nrows=2, figsize=[8,8])
var = ['DEP_DELAY_GROUP', 'ARR_DELAY_GROUP']
for i in range(len(var)):
ax[i].hist(data=df_ndely, x=var[i])
ax[i].set_xlabel('{} in n * 15 min blocks'.format(var[i]))
ax[i].set_xlim((1,13))
ax[i].set_ylim((0,400000))
ax[i].set_xticks(np.arange(1,14,1))
plt.show()
t_bins = np.arange(min(df_ndely[var[i]]), max(df_ndely[var[i]])+1.0, 1.0)
len(t_bins)
fig, ax = plt.subplots(nrows=2, figsize=[15,8])
var = ['DEP_DELAY', 'ARR_DELAY']
for i in range(len(var)):
ax[i].hist(data=df_ndely, x=var[i], bins = t_bins,
)
ax[i].set_xlabel('{} in minutes'.format(var[i]))
ax[i].set_xlim(-60, 250)
ax[i].set_xticks(np.arange(-60,250,10))
# ax[i].axvline(df_ndely[var[i]].value_counts().index[0],c='r')
plt.show()
df_dely = df_H12019.query('DEP_DELAY_NEW > 0 | ARR_DELAY_NEW > 0')
fig, ax = plt.subplots(nrows=1, figsize=[8,5])
var = 'DAY_OF_WEEK'
ax.hist(data=df_dely, x=var,
width = 0.5
)
ax.set_xlabel('Day of the Week'.format(var))
ax.set_xticklabels(['x', 'Sun','Mon','Tue','Wed','Thu','Fri','Sat']) #, [1,2,3,4,5,6,7])
ax.axvline(df_ndely[var].value_counts().index[0],c='r')
plt.show()
df_dely = df_H12019.query('DEP_DEL15 > 0 | ARR_DEL15 > 0')
fig, ax = plt.subplots(nrows=1, figsize=[8,5])
var = 'DAY_OF_WEEK'
ax.hist(data=df_dely, x=var,
width = 0.5,
align='mid'
)
ax.set_xlabel('Day of the Week'.format(var))
ax.set_xticklabels(['x', 'Sun','Mon','Tue','Wed','Thu','Fri','Sat']) #, [1,2,3,4,5,6,7])
ax.axvline(df_dely[var].value_counts().index[0],c='r')
plt.show()
df_Ddely = df_H12019.query('DEP_DELAY_NEW > 0')
x = df_Ddely['DEP_TIME_BLK'].unique() #.sort()
x.sort()
sb.countplot(data=df_Ddely, x='DEP_TIME_BLK',
color='g',
order=x,
);
plt.xticks(rotation=90);
df_Adely = df_H12019.query('ARR_DELAY_NEW > 0')
x = df_Adely['ARR_TIME_BLK'].unique() #.sort()
x.sort()
sb.countplot(data=df_dely, x='ARR_TIME_BLK',
color='g',
order=x,
);
plt.xticks(rotation=90);
df_ddely = df_H12019.query('ARR_DEL15 == 0 and DEP_DEL15 > 0')
df_adely = df_H12019.query('DEP_DEL15 == 0 and ARR_DEL15 > 0 ')
df_xdely = df_H12019.query('DEP_DEL15 > 0 and ARR_DEL15 > 0 ')
x = df_ddely[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']].sum()
x
y = df_adely[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']].sum()
y
z = df_xdely[['CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY']].sum()
z
x.plot(kind='bar', title='Total Departure delay with Arrival Delay < 15 minutes by Categories')
y.plot(kind='bar', title='Total Arrival delay mins with < 15 min departure Delay minutes by Categories')
z.plot(kind='bar', title='Total delay minutes by Categories when both departure delay and arrival delay > 15 mins')
Delay Time Group is left skewed similarly both Depture Delay time and Arrvial departure time is left skewed. We could observer that most of the flights departed 5 mins earlier and arrived 10 mins sooner. But there is long tail of arrival delay steadly declining to 120 mins. But there are outliners on arrival delays.
Most of the flights departs on time and interestingly 5 to 1 minitue earlier . There is gradual decline in the number of depatures from 0 mintures to 110 minitues and bulk of the departure delay minitues are in in this range.Week of the Day Wednesdays and Thursdays found to have most delays followed by Saturdays and Sundays. Fridays have least delays
Time of the Day block Departure delays are almost evently spread across between 1000 hrs to 2100 hrs, Most number of delays are between evening 5 pm to 6 pm. elay are see towards end of the day between 4 pm to 10 pm. The least delay observed in morning arrival flights
Delay Categories There are no arrival delay beyound 15 mins delays are not attribute to any of the delay catergories. The NAS delay is extremly high in case of arrival delay.
The fligts which are cancelled and Diverted are ommited from analysis as are bound to cause huge delays and could be outliers. Only flight which reached their destinations are included in the analysis.
Both arrival and departure delay mintues are fairly normally distributed and does not require any transformations.
plt.figure(figsize=(30,30));
vars=['ACTUAL_ELAPSED_TIME',
'AIR_TIME',
'ARR_DELAY',
# 'ARR_DELAY_NEW',
# 'ARR_TIME',
# 'CRS_ARR_TIME',
# 'CRS_DEP_TIME',
'CRS_ELAPSED_TIME',
'DAY_OF_WEEK',
'DEP_DELAY',
# 'DEP_DELAY_NEW',
'DEP_TIME',
# 'DEST',
# 'DEST_CITY_MARKET_ID',
'DISTANCE',
# 'FIRST_DEP_TIME',
# 'FLIGHTS',
'MONTH',
# 'OP_CARRIER',
# 'OP_CARRIER_AIRLINE_ID',
# 'ORIGIN',
# 'ORIGIN_CITY_MARKET_ID',
'QUARTER',
'TAXI_IN',
'TAXI_OUT',
'WHEELS_OFF',
'WHEELS_ON']
sb.pairplot(data=df_ndely.sample(1000),
vars=vars,
plot_kws={'alpha':1/3}
)
plt.figure(figsize = [18, 15]);
sb.heatmap(df_ndely[vars].corr(), annot = True, fmt = '.3f',
cmap = 'viridis', center = 0)
plt.show();
plt.figure(figsize = [18, 15]);
vars=[ 'ARR_DELAY_NEW',
'ARR_TIME',
'CRS_ARR_TIME',
'CRS_DEP_TIME',
'CRS_ELAPSED_TIME',
'DEP_DELAY_NEW',
'DEST',
'DEST_CITY_MARKET_ID',
'FIRST_DEP_TIME',
'MONTH',
'OP_CARRIER',
'OP_CARRIER_AIRLINE_ID',
'ORIGIN',
'ORIGIN_CITY_MARKET_ID']
sb.heatmap(df_ndely[vars].corr(), annot = True, fmt = '.3f',
cmap = 'viridis', center = 0)
plt.show();
plt.figure(figsize = [13, 10]);
vars=[ 'ARR_DELAY_NEW',
'DEP_DELAY_NEW',
'LATE_AIRCRAFT_DELAY',
'NAS_DELAY',
'CARRIER_DELAY',
'WEATHER_DELAY',
'SECURITY_DELAY',
]
sb.heatmap(df_ndely[vars].corr(), annot = True, fmt = '.3f',
cmap = 'viridis', center = 0)
plt.show();
The above heat map should, Depature and Arrival delays are highly related. Carrier delay and Late Aircraft delays and Over all delays are fairly related.¶
a = df_H12019.groupby(by='DEST')['ARR_DELAY'].sum().sort_values(ascending=False)
plt.figure(figsize=(15,7))
sb.barplot(x=a.index, y=a.values, order=a.index)
plt.xticks(rotation=90);
b = a[a > 100000]
plt.figure(figsize=(15,7))
sb.barplot(x=b.index, y=b.values, order=b.index)
plt.xticks(rotation=90);
plt.title('US Airports causing more the 100k minutes or more arrival delay in H12019');
plt.ylabel('Total Minitues');
plt.xlabel('Destination Airport');
c = df_H12019[df_H12019['DEST'].isin(b.index.values)]
d = b.index.values
d
plt.figure(figsize=(15,12))
sb.stripplot(data=c, x='DEST', y='ARR_DELAY_NEW',
order=d,
)
plt.yticks(np.arange(0,c['ARR_DELAY_NEW'].max(),30));
plt.xticks(rotation=90);
plt.title('General distribution of delay minutes for US Airports causing total delay over 100k minutes in H1 2019');
colour = sb.color_palette()[4]
plt.figure(figsize=(15,10))
sb.violinplot(data=c, x='DEST', y='ARR_DELAY_NEW', order=d)
plt.yticks(np.arange(0,c['ARR_DELAY_NEW'].max(),30));
plt.xticks(rotation=90);
df_60 = c[c['ARR_DELAY_NEW'] <= 60 ]
plt.figure(figsize=(15,8))
sb.pointplot(data=df_60,
x='DEST',
y='ARR_DELAY_NEW',
order=d,
estimator=np.mean,
ci='sd',
# notch=True,
# fliersize=1/3,
)
plt.yticks(np.arange(0,df_60['ARR_DELAY_NEW'].max(),5));
plt.xticks(rotation=90);
plt.title('Arrival Delay in US airport within 60 minutes');
df_GT60 = c[c['ARR_DELAY_NEW'] > 60 ]
plt.figure(figsize=(15,8))
sb.pointplot(data=df_GT60,
x='DEST',
y='ARR_DELAY_NEW',
order=d,
estimator=np.mean,
ci='sd',
# notch=True,
# fliersize=1/3,
)
plt.yticks(np.arange(0,df_GT60['ARR_DELAY_NEW'].max(),60));
plt.xticks(rotation=90);
plt.title('Mean Arrival Delay in US airport greater than 60 minutes');
i = df_H12019.groupby(by='OP_UNIQUE_CARRIER')['DEP_DELAY'].sum().sort_values(ascending=False)
plt.figure(figsize=(15,7))
sb.barplot(x=i.index, y=i.values, order=i.index)
plt.xticks(rotation=90);
plt.title('Total Delay by Carrier');
j = df_H12019.groupby(by='OP_UNIQUE_CARRIER')['DEP_DELAY'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,7))
sb.barplot(x=j.index, y=j.values, order=j.index)
plt.xticks(rotation=90);
plt.title('Mean Delay by Carrier');
k = df_H12019
k['ROUTE'] = k['ORIGIN']+'-'+k['DEST']
k1 = k.groupby(by='ROUTE')['DEP_DELAY_NEW'].sum().sort_values(ascending=False)
k2 = k.groupby(by='ROUTE')['ARR_DELAY_NEW'].sum().sort_values(ascending=False)
k3 = k.groupby(by='ROUTE')['DEP_DELAY_NEW'].mean().sort_values(ascending=False)
k4 = k.groupby(by='ROUTE')['ARR_DELAY_NEW'].mean().sort_values(ascending=False)
k1 = k1[k1>50000]
k2 = k2[k2>50000]
fig, ax = plt.subplots(nrows=2, figsize=[15,10])
sb.barplot(x=k1.index,
y=k1.values,
color='g',
order=k1.index,
ax=ax[0]
);
ax[0].set_xlabel('AIR routes')
ax[0].set_ylabel('Total Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
ax[0].set_title('Total Departure Delay over 500k minutes on Air routes ')
sb.barplot(x=k2.index,
y=k2.values,
color='g',
order=k2.index,
ax=ax[1]
);
plt.xlabel('Air Routes')
ax[1].set_ylabel('Total Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)
ax[1].set_title('Total Arrival Delay over 500k minutes on Air routes ')
plt.show();
k3 = k3[k3 > 30.0]
k4 = k4[k4 > 30.0]
fig, ax = plt.subplots(nrows=2, figsize=[15,10])
sb.barplot(x=k3.index,
y=k3.values,
color='g',
order=k3.index,
ax=ax[0]
);
ax[0].set_xlabel('AIR routes')
ax[0].set_ylabel('Mean Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
sb.barplot(x=k4.index,
y=k4.values,
color='g',
order=k4.index,
ax=ax[1]
);
plt.xlabel('Air Routes')
ax[1].set_ylabel('Mean Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)
plt.show();
k5 = k3[k3 > 60.0]
k6 = k4[k4 > 60.0]
fig, ax = plt.subplots(nrows=2, figsize=[15,10])
sb.barplot(x=k5.index,
y=k5.values,
color='g',
order=k5.index,
ax=ax[0]
);
ax[0].set_xlabel('AIR routes')
ax[0].set_ylabel('Mean Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
sb.barplot(x=k6.index,
y=k6.values,
color='g',
order=k6.index,
ax=ax[1]
);
plt.xlabel('Air Routes')
ax[1].set_ylabel('Mean Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)
plt.show();
df_flnum = df_H12019
df_flnum['FL_NUM'] = df_flnum['OP_UNIQUE_CARRIER'] + df_flnum['OP_CARRIER_FL_NUM'].astype(str)
flnum1 = df_flnum.groupby(by='FL_NUM')['DEP_DELAY_NEW'].mean().sort_values(ascending=False)
flnum2 = df_flnum.groupby(by='FL_NUM')['ARR_DELAY_NEW'].mean().sort_values(ascending=False)
fig, ax = plt.subplots(nrows=2, figsize=[20,12])
sb.barplot(x=flnum1[:50].index,
y=flnum1[:50].values,
color='g',
order=flnum1[:50].index,
ax=ax[0]
);
ax[0].set_xlabel('Flight Numbers')
ax[0].set_ylabel('mean Departure Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
ax[0].set_title('Top 50 Mean dealy minutes')
sb.barplot(x=flnum2[:50].index,
y=flnum2[:50].values,
color='g',
order=flnum2[:50].index,
ax=ax[1]
);
plt.xlabel('Flight Numbers')
ax[1].set_ylabel('Mean Arrival Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)
plt.show();
tlnum1 = df_H12019.groupby(by='TAIL_NUM')['DEP_DELAY_NEW'].mean().sort_values(ascending=False)
tlnum2 = df_H12019.groupby(by='TAIL_NUM')['ARR_DELAY_NEW'].mean().sort_values(ascending=False)
fig, ax = plt.subplots(nrows=2, figsize=[20,12])
sb.barplot(x=tlnum1[:50].index,
y=tlnum1[:50].values,
color='g',
order=tlnum1[:50].index,
ax=ax[0]
);
ax[0].set_xlabel('Flight Tail Numbers')
ax[0].set_ylabel('mean Departure Delay minutes')
xtl = ax[0].get_xticklabels()
ax[0].set_xticklabels(xtl, rotation=90)
ax[0].set_title('Top 50 Mean dealy minutes')
sb.barplot(x=tlnum2[:50].index,
y=tlnum2[:50].values,
color='g',
order=tlnum2[:50].index,
ax=ax[1]
);
plt.xlabel('Flight Tail Numbers')
ax[1].set_ylabel('Mean Arrival Delay minutes')
xtl = ax[1].get_xticklabels()
ax[1].set_xticklabels(xtl, rotation=90)
plt.show();
Airport vs Arrival Delay On average 5 minutes delay for airports with more than 100k mnutes total delay and delay is less that 1 hours. When the delay is more than 1 hour the average delay is 180 minutes.
Airlines vs Departure delay EVA airlines, and Jetblue airlines are worse performing airlines in term of average delay .They are at 17.5 minutes delay on average.
Airport Routes - Total delays is shown in Chiago to Newyork and LA to SFO routes because of relatively large number of flighys, The large mean delay is seen in minor airports and Minor routes in Florida and Phenix
Flight Number - There are few flight numbers especially flight OO4711 is greatest delay causing flights other than other private jets
Tail Numbers - There are specific Air crafts causing delays are N684RW, N656YX, N2341U, N728AN
Few AIRPORT, Routes, Aircrafts, Flight number are causing most delays. There are also some days and time of the day cusing delays. This will be investigated further in Multivariate analysis.
First Delay by Week days , time of the day vs Cause of delay. We have observed most delays in SUN, WED THU SAT and time of the day between 1500 to 2159 for Depature dealy and 1600 to 2359 for Arrival Delay¶
# Select only most delay causing days SUN, WED, THUs , SAT
df_w = df_H12019[df_H12019['DAY_OF_WEEK'].isin([1,4,5,7])]
# Select flight where the is 15 minutes or more arrival or departure delays
df_w = df_w.query('DEP_DELAY_GROUP > 0.0 | ARR_DELAY_GROUP > 0.0')
# Select more than flights with more than 15 mins arrival and departure delay.
df_wd = df_w.query('DEP_DEL15 > 0 and ARR_DEL15 > 0 ')
#select flights with departed in the hour blocks when most delays are seen ( 1500 - 2159)
df_wd = df_wd[df_wd['DEP_TIME_BLK'].isin(['1500-1559','1600-1659','1700-1759','1800-1859','1900-1959','2000-2059','2100-2159'])]
#Select flight which has no departure delays but more than 15 mins arrival delays
df_wa = df_w.query('DEP_DEL15 == 0 and ARR_DEL15 > 0')
#select flights in time block where most arrival delays are observed 1600 - 2359
df_wa = df_wa[df_wa['ARR_TIME_BLK'].isin(['1600-1659','1700-1759','1800-1859','1900-1959','2000-2059','2100-2159','2200-2259','2300-2359'])]
#
del df_w
# create faceted heat maps Day of the week Time of the day and cause for delay for departures
g = sb.FacetGrid(data = df_wd,
col = 'DEP_TIME_BLK',
row = 'DAY_OF_WEEK',
size = 3,
margin_titles = True,
)
plt.xticks(rotation='vertical')
g.map(plt.bar,
x=df_wd[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).index,
height=df_wd[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).values,
color = 'g', )
g.set_xlabels('Delay Categories')
g.set_ylabels('Mean Delay Minutes')
plt.show()
# Get only records which has Late Aircraft Delay catergory
df_wd1 = df_wd[df_wd['LATE_AIRCRAFT_DELAY'] > 0.0 ]
df_wd1.sample()
df_wd1.groupby(by='OP_UNIQUE_CARRIER')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index
sb.catplot(data = df_wd1,
x='OP_UNIQUE_CARRIER',
y='LATE_AIRCRAFT_DELAY',
hue='DAY_OF_WEEK',
kind='bar',
height=10,
aspect=2,
palette = 'Blues',
estimator=np.mean,
order=df_wd1.groupby(by='OP_UNIQUE_CARRIER')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index,
)
plt.suptitle('Airlines Causing most departure delays due Late Aircraft arrival by Week day')
plt.xlabel('Airline Code');
plt.ylabel('Mean Delay in Minutes');
plt.show();
df_wd2 = df_wd1[df_wd1['OP_UNIQUE_CARRIER'].isin(['YV','UA'])].sort_values(by='LATE_AIRCRAFT_DELAY',ascending=False)[:50]
df_wd2.sample()
df_wd2.groupby(by='TAIL_NUM')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index[:50]
sb.catplot(data = df_wd2,
x='TAIL_NUM',
y='LATE_AIRCRAFT_DELAY',
hue='OP_UNIQUE_CARRIER',
kind='bar',
height=10,
aspect=2,
palette = 'Greens',
order=df_wd2.groupby(by='TAIL_NUM')['LATE_AIRCRAFT_DELAY'].mean().sort_values(ascending=False).index,
)
plt.suptitle('Top 50 Aircrafts Causing most departure delays due Late Aircraft arrival by Airline Company YU and UA',
fontsize=20.0)
plt.xticks(rotation=90)
plt.xlabel('Aircraft Tail Number')
plt.ylabel('Mean Late Arrival Minutes')
plt.show();
plt.figure(figsize=(10,8))
sb.scatterplot(data=df_wd2,
x='LATE_AIRCRAFT_DELAY',
y='DEP_DELAY',
hue='OP_CARRIER_FL_NUM',
palette='viridis_r',
style='OP_UNIQUE_CARRIER'
# x_jitter=0.3,
# y_jitter=0.3,
# alpha=0.3)
)
plt.legend(title='Flight Number')
plt.suptitle('Distribution of Departure delays and Late Aircraft arrival delay')
plt.xlabel('Late Aircraft Delay in Minutes')
plt.ylabel('Departure Delay in Minutes')
We could clealy say UA is worst performing in flights UA2500 Houston,TX to Settle.WA & UA5000 Newyork to Clevland .¶
YV flight ASH 7500 is the next worst performing next to UA fligths.¶
# create faceted heat maps Day of the week Time of the day and cause for delay for departures
g = sb.FacetGrid(data = df_wa,
col = 'ARR_TIME_BLK',
row = 'DAY_OF_WEEK',
size = 3,
margin_titles = True,
)
plt.xticks(rotation='vertical')
g.map(plt.bar,
x=df_wa[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).index,
height=df_wa[['LATE_AIRCRAFT_DELAY', 'NAS_DELAY','CARRIER_DELAY','WEATHER_DELAY', 'SECURITY_DELAY']].mean().sort_values(ascending=False).values,
color = 'g', )
g.set_xlabels('Delay Categories')
g.set_ylabels('Mean Arrival Delay Minutes')
plt.show()
# Get only records which has NAS Delay catergory
df_wa1 = df_wa[df_wa['NAS_DELAY'] > 0.0 ]
df_wa1.sample()
# Group by destiation Aiport to find the mean NAS Delay by airport
df_wa2 = df_wa1.groupby(by='DEST')['NAS_DELAY'].mean().sort_values(ascending=False)
df_wa2.size
plt.figure(figsize=(15,10));
sb.barplot(df_wa2.index[:50], df_wa2.values[:50], color = 'g')
plt.xticks(rotation=90);
plt.suptitle('Mean NAS Delay Minutes of Aiports with Top 50 NAS Delay Minutes', fontsize=10);
plt.xlabel('Destination Airport');
plt.ylabel('Minutes');
plt.show();
df_wa4 = df_H12019.query("DEST in ['IAG','IMT','ALO','BFM','ATY']").query("NAS_DELAY > 0.0")
df_wa4.sample()
sb.catplot(data=df_wa4,
x='DEST',
y='ARR_DELAY_NEW',
hue='DAY_OF_WEEK',
kind='bar',
height=10,
aspect=2,
palette = 'Blues',
estimator=np.mean,
order=df_wa3.groupby(by='DEST')['ARR_DELAY_NEW'].mean().sort_values(ascending=False).index,
)
plt.suptitle('Top 5 Airports Causing most Arrival delays due to NAS delay by Week day',
fontsize=20)
plt.xlabel('Airport Code', fontsize=20);
plt.ylabel('Arrival Delay in Minutes', fontsize=20);
plt.show();
# Select only top Destination with 40 mins or more Arrival delays
df_wa3 = df_wa1.query("DEST in ['IAG','IMT','ALO','BFM','ATY']")
df_wa3.sample()
sb.catplot(data=df_wa3,
x='DEST',
y='NAS_DELAY',
hue='DAY_OF_WEEK',
kind='bar',
height=10,
aspect=2,
palette = 'Blues',
estimator=np.mean,
order=df_wa3.groupby(by='DEST')['NAS_DELAY'].mean().sort_values(ascending=False).index,
)
plt.suptitle('Top 5 Airports Causing most Arrival delays due to NAS delay for SUN, WED, THU, SAT',
fontsize=20)
plt.xlabel('Airport Code', fontsize=20);
plt.ylabel('Mean Delay in Minutes', fontsize=20);
plt.show();
We have see Departure delays are primarly caused by Late Aircraft Arrival delay. This is attributable to the Airlines.
We could further see that YU and UA are worst performing Airline for departure delays.
We narrowed down to 3 flight numbers which have consistant delayed arrival. One of the routes is Houston to Settel and other is Newyork to Clevland
The Arrival delays are more to do with NAS delays which are due to Airports ATS congession.
IAG ,IMT, ALO, BFM, ATY are 5 most arrival delay causing airports.
Futher breaking down by Day of the week IAG- Niagra falls Airport is congested only on Saturdays. IMT ford airports only on Thurdays. ALO Waterloo airport on Sundays, Wednesday, Thursday and Sundays, but mostly on Wednesdays. BMF airport on Sundays . ATY airport on Saturdays.
There should be reason for Air traffic congestion on these specific day for specific airport. These have to be further investigated to understand the root cause.
IAG and ALO airports are looks to have fairly conjested air traffic, IMT is worst performing with Arrival delays peaking at 350 minutes on saturdays, This will certainly need closer examination.
The Saturdays have most delayed arrival aircrafts and Sundays have next most delayed arrivals
Arrival delays are due to NAS delays and Departure delays are due to late arrival of Aircraft.
Flight number , Aircrafts cause most departure delays. Some destination airports cause extreamly high arrival delays
Carrier delays are the second most prominent cause of departure delays